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EFFICIENT IMPORTANCE SAMPLING FOR 
BINARY CONTINGENCY TABLES^ 

By Jose H. Blanchet 

Columbia University 

Importance sampling has been reported to produce algorithms 
with excellent empirical performance in counting problems. However, 
the theoretical support for its efflciency in these applications has been 
very limited. In this paper, we propose a methodology that can be 
used to design efficient importance sampling algorithms for count- 
ing and test their efficiency rigorously. We apply our techniques af- 
ter transforming the problem into a rare-event simulation problem — 
thereby connecting complexity analysis of counting problems with 
efflciency in the context of rare-event simulation. As an illustration 
of our approach, we consider the problem of counting the number of 
binary tables with fixed column and row sums, Cj's and r^'s, respec- 
tively, and total marginal sums d = Cj . Assuming that maxj Cj = 

o{d}^'^), = 0{d) and the r/s are bounded, we show that a suit- 

able importance sampling algorithm, proposed by Chen et al. [J. 
Amer. Statist. Assoc. 100 (2005) 109-120], requires 0(d^e"^(S"^) op- 
erations to produce an estimate that has ^-relative error with proba- 
bility 1 — (5. In addition, if maxj cj = o{d^^*~^°) for some So > 0, the 
same coverage can be guaranteed with 0(d^e~^ log(5~^)) operations. 

1. Introduction. We are interested in the complexity analysis of sequen- 
tial or state-dependent importance sampling algorithms (SIS) for counting 
problems. The development of algorithms for approximate counting in poly- 
nomial time has been a topic of great interest in theoretical computer science 
[see Valiant (1979)]. Successful techniques have been developed for efficient 
approximate counting based on the Markov Chain Monte Carlo (MCMC) 
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method [see the texts by Sinclair (1993) and Jerrum (2003) for detailed infor- 
mation on these techniques]. A different class of randomized algorithms for 
approximate counting, based on importance sampling, has received substan- 
tial attention recently [basic notions on importance sampling are discussed 
in Section 2.2; for additional background on importance sampling, see As- 
mussen and Glynn (2007) and Liu (2001)]. Chen et al. (2005) proposed an 
algorithm based on importance sampling for counting the number of bi- 
partite graphs with a given degree sequence. They tested their algorithm 
empirically and observed that it achieved excellent performance. Recently, 
Blitzstein and Diaconis (2008) have also used importance sampling algo- 
rithms for approximately counting the number of acyclic and undirected 
graphs with a given degree sequence. In addition, Rubinstein (2007) and 
Botev and Kroese (2008) have applied adaptive importance sampling algo- 
rithms to a variety of combinatorial problems, including counting and opti- 
mization. Although many of these algorithms based on importance sampling 
seem to have excellent practical performance, the theoretical framework to 
carry through a rigorous analysis of their performance is still under devel- 
opment. 

Our purpose is to illustrate a framework that can be used to design 
efficient importance sampling algorithms for counting and provide a rig- 
orous analysis of their computational complexity. Our method provides a 
direct connection between asymptotic approximations and efficient impor- 
tance sampling, and we believe that the principle underlying this connection 
can be applied in substantial generality. In order to illustrate our proposed 
techniques, we shall consider the problem of counting the number of 0-1 ma- 
trices with specified column and row sums — these types of matrices are also 
called binary contingency tables in statistical applications. In the context of 
graph theory, this problem is equivalent to that of counting the number of 
bipartite graphs with a given degree sequence. 

Returning to the problem that we consider here, we mention that sta- 
tistical analysis of binary contingency tables is a problem that has been 
motivated by several application domains, including some in biology as ex- 
plained in Chen et al. (2005). Our goal is to provide rigorous support for 
the observed experimental efficiency of a class of SIS algorithms proposed 
by Chen et al. (2005) for counting binary contingency tables. Formally, the 
problem consists in developing fast computational algorithms for counting 
the number of solutions {xij : I < i < m, 1 < j < n} to 
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Xij e {0,1}. 

Let us define d = J2iLi "^i = Z]j=i Cj- Our complexity analysis is performed by 
sending d y oo (and n, m y oo as well) in the context of sparse matrices un- 
der regularity conditions. In particular, we assume that maxj<„Cj = o{d^^'^), 
J2]=ic'j = 0{d) and that the r^'s are bounded. We shall construct and an- 
alyze a SIS-based estimator, that is, e-close (in relative terms) to the total 
number of solutions to (l)-(2) with coverage probability of at least 1 — 5 
and that requires 0{d^e~'^5~^) operations as d oo and e,(5\0 for its 
construction. Moreover, by imposing an additional growth condition of the 
form maxj<.„ Cj = o{d^/'^~^°) for some (5o > as d /' oo, we obtain that SIS 
yields a "fully polynomial randomized approximation scheme" in the sense 
that 0{d^e~'^ log(5~^)) operations are sufficient to produce an estimator that 
has e-relative precision with probability 1 — 6. 

The proposed strategy for constructing and designing SIS algorithms pro- 
ceeds as follows. The first step is to transform the counting problem into a 
so-called rare-event estimation problem; that is, the problem of estimating 
the probability of a rare event. Such probability is given by the ratio of 
the number of required solutions (xjj's) satisfying both (1) and (2) divided 
by the number of solutions satisfying only the set of constraints (2) (i.e., 
there is no restriction on the row sums), which can be easily computed. The 
second step is to recognize that such probability can be characterized by a 
system of linear equations, which is obtained by conditioning on the first 
increment of a suitably defined m-dimensional random walk. The solution 
to this system of equations provides the means of constructing the optimal 
importance sampling distribution, which is state-dependent. Such optimal 
importance sampling distribution corresponds to the so-called Doob's h- 
transform which arises in the context of positive harmonic functions. The 
next step is to use results developed by McKay (1984) [see also Greenhill, 
McKay and Wang (2006) and Bekessy, Bekessy and Komlos (1972)] that ap- 
proximate the number of solutions satisfying constraints (1) and (2) in the 
context of large and sparse matrices that we have adopted here. We then use 
these approximations (which we have extended in Theorem 1 to cover our 
assumptions) to construct an importance sampling distribution that mim- 
ics the behavior of the optimal importance sampler (thus, the better the 
approximation the closer the importance sampler to the optimal one). 

It turns out that the importance sampling algorithm suggested by the 
previous strategy coincides with one of the algorithms studied in Chen et al. 
(2005). Our results imply that in the context of large and sparse matrices 
satisfying the assumptions indicated above, the variance of the estimator 
obtained by the procedure has the best possible asymptotic performance. 
That is, the coefficient of variation of the estimator (the ratio of the stan- 
dard deviation to the probability of interest) remains bounded as d /' oo. 
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In the context of rare-event simulation, an estimator that has a bounded 
coefficient of variation is said to be strongly efficient (a notion that will 
be reviewed in Section 4). Moreover, we show that if maxj<„Cj = o^d}^^"^'') 
then the proposed estimator is exponentially efficient, a concept that is intro- 
duced in Section 4, and in particular implies strong efficiency. In particular, 
exponential efficiency allows to conclude that under our assumptions, the 
proposed SIS estimator is e-close in relative terms with coverage probability 
1 — 6 and has complexity 0{d^e~'^log{6^^)) as indicated above, in contrast 
to a complexity of order 0{d^e~'^6~^) corresponding to strong efficiency. 

A recent algorithm for counting binary tables developed by Bezakova, 
Bhatnagar and Vigoda (2006), based on MCMC techniques (simulated 
annealing), has been shown to have complexity of (roughly) 
0(d^(nm)^ maxj.j(cj, rj)) operations. However, it is important to note that 
the procedure proposed by Bezakova, Bhatnagar and Vigoda (2006) works 
in complete generality (i.e., it does not require the sparsity assumptions 
imposed here). Other algorithms based on MCMC have been devised for 
counting binary contingency tables with certain regularity conditions on the 
degree sequence (such as ours). For instance, Kim and Vu (2003) assumed 
that max(rj,Cj) = o{d^^^) and proposed an algorithm that allows to gen- 
erate an almost uniform bipartite graph (with given degree sequence) in 
time 0{d^). Kannan, Tetali and Vempala (1997) also study the problem of 
uniform generation of bipartite graphs with given degrees. 

In a recent paper [Bayati, Kim and Saberi (2007)] used ideas based on SIS 
to construct an algorithm for generation of simple graphs with a given de- 
gree sequence (a slightly different problem than the one that we study here) . 
Under regularity conditions (similar to those imposed here), they proved 
that their proposed algorithm has excellent performance for asymptotically 
uniform generation, basically linear complexity, which makes the algorithm 
optimal in the sense that no faster complexity rate is possible. Their meth- 
ods seem completely different from those developed here. In particular, we 
do not require the explicit use of concentration inequalities but instead the 
application of bounds related to Lyapunov inequalities for Markov chains. In 
addition, our methods suggest a natural way to develop efficient SIS in a va- 
riety of settings — basically if the optimal importance sampling distribution 
is described by a Markov chain and there are asymptotic approximations for 
the quantity of interest. 

It is important to emphasize that although our complexity analysis of 
SIS suggests very good performance [which is validated by the computer 
experiments performed by Chen et al. (2005)], such performance can only 
be guaranteed under certain regularity conditions. This has been noted by 
Bezakova et al. (2007) who constructed a counterexample showing that a 
SIS related to the one presented here [also proposed by Chen et al. (2005)] 
can have exponential time complexity if the degree sequence is allowed to 
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grow arbitrarily. Nevertheless, one of the main points that we intend to 
communicate is that the general method outlined here could be adapted to 
specific contexts in which the problem at hand seems to have certain regu- 
larity properties that allow to develop approximations. The strategy would 
be then to enhance the approximations by means of efficient computational 
algorithms that can be shown to have desirable complexity properties in an 
asymptotic regime related to the developed approximations. 

The basic principles behind the design and analysis of the SIS algorithms 
discussed here can be applied more broadly. For instance, Blanchet and 
Glynn (2008) apply these principles in the context of first-passage time prob- 
abilities with an emphasis on one-dimensional random walk problems with 
general heavy-tailed increments (which are of particular interest in insur- 
ance and queueing). Another example is given in Blanchet and Liu (2008), 
which develops strongly efficient rare-event simulation algorithms for large 
deviation probabilities of regularly varying random walks. The analysis of 
SIS algorithms in rare-event simulation involves constructing so-called Lya- 
punov functions, which are solutions to certain inequalities that are used in 
stability analysis of Markov processes. The use of Lyapunov inequalities in 
the context of counting problems as the one considered here is particularly 
interesting because the dimension of the state-variables of the underlying 
Markov process (in our context m) is growing. As we shall discuss in Sec- 
tion 5, the construction of a suitable Lyapunov function often requires a 
good understanding of the local likelihood ratio obtained at each step of the 
simulation. 

The rest of the paper is organized as follows. In Section 2, we relate the 
problem of counting binary contingency tables to its rare-event simulation 
counterpart. Basic notions involving importance sampling and the optimal 
change-of-measure are also discussed in Section 2. Section 3 develops asymp- 
totic approximations that are then used in the construction of the algorithm 
that we analyze. Section 4 introduces efficiency notions that are applied in 
rare-event simulation and discusses connections to related ideas used in the 
context of approximate counting. The complexity analysis of the counting 
algorithm, which leads to the proof of our main result, namely Theorem 2, 
is given in Section 5. 

2. Counting, rare-event simulation and importance sampling. A 1 ta- 
ble with specified marginals is a binary array (0-1 elements) of dimen- 
sions m X n such that the sum of the elements in the ith row equals 
(i G {1, . . . ,m}) and the corresponding sum over the jth column equals cj, 
j G {l,...,n}. 

Notational convention. Throughout the rest of the paper, we shall use 
the notation c = (ci , . . . , c„) , r = (ri , . . . , r^) and J2]=i = = J^iLi ■ In 
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addition, we shall reserve the use of boldface letters to denote vectors or 
high-dimensional objects. Random variables are denoted using capital letters 
and the use of lower case is restricted to deterministic quantities (including 
specific realizations of random objects). We also use the notation f{t) = 
0{g{t)) if there exists a constant mi G (0, oo) such that < mig{t); 

if, in addition, > m2g{t) for some m2 G (0,00), then f{t) = @{g{t)). 

Finally, we say that f{t) = o{g{t)) as t /' 00 if f{t)/g{t) — > as i 00. 

2.1. Counting binary tables via a random walk computation. We are in- 
terested in developing an importance sampling algorithm that allows to ef- 
ficiently count the number of such arrays, which we shall denote by //(r,c). 
First, note that the number of tables with m rows and given only column 
marginals c is 

,(c,m) 4 

So, the number of tables with given column and row marginals, c and r, 
respectively, can be evaluated via 

r?(c,m) •P(T(Y) = r), 

where T(Y) E is the row marginals of a table Y sampled uniformly 
over the space of tables with column marginals c [i.e., T(Y)j = Yl^=i^i,j 
for 1 < i < m] . As a consequence, the problem of counting the number of 
binary contingency tables is equivalent to that of estimating P(T(Y) = r). 
We can see that efficient estimation of this probability with good relative 
precision is not straightforward because the probability in question may 
become arbitrarily small as the size of the table increases. In other words, 
the event {T(Y) = r} would typically be rare. 

We shall formulate the problem of estimating P(T(Y) = r) as a sequen- 
tial rare-event simulation problem involving a suitably defined random walk. 
Define an m-dimensional random walk (rw) S = (S^ : < A; < n) as fol- 
lows. Given vectors of nonnegative integers c and r, set Sq = r and (for 
/c € {1, . . . ,n}) define = Sfc_i — where is a 0-1 entry vector (of 
dimension m) with uniform distribution over the space of configurations 
(xfe,i, . . . such that YJj=i ^k,j = Ck and (for I < j < m) Xkj G {0, 1}. 

The vector Xfc represent the kth column of the table. The random vectors 
(Xfc -.1 <k <n) are assumed to be independent. Finally, let us write -Pr.c(") 
for the probability law generated by the rw S subject to Sq = r and -E'r,c(") 
for the corresponding expectation operator. Note that -Pr,c(') is defined via 
a time inhomogeneous Markov chain; this is because the distributions of the 
Xfc's change in time according to c. 

Observe that 



u(r, c) ^ Pr,c(S„ = 0) = P(T(Y) = r). 
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As we shall see, the function n(-) can be used to describe the conditional 
distribution of the rw S given that Sn = 0. In turn, such description is highly 
relevant for the design of efficient importance sampling algorithms. 

2.2. Basic notions on importance sampling. We shall briefly discuss ba- 
sic concepts related to importance sampling and then apply these concepts to 
the random walk computation described in the previous section. For a more 
detailed discussion on importance sampling, see, for instance, Asmussen and 
Glynn (2007), Glynn and Iglehart (1989) and Liu (2001). 

Suppose that we want to estimate P{Z € ^) > 0, for a given random ob- 
ject taking values on a space X with a ci-field B. Let us define the probability 
measure Fz{dz) on {X,13) via Fz{dz) = P{Z € dz), so that 

PiZ£A)= I Fzidz). 

J A 

We say that a probability measure G{dz) on {X, B) is an admissible choice 
for an importance sampler or change- of -measure if Fz{dz)lA{z) is abso- 
lutely continuous with respect to G{dz) [note that Fz{dz)lA{z) is not nec- 
essarily a probability measure] . In other words, if G{dz) is admissible then 
G{B) = implies P{A n S) = 0, so that the Radon-Nikodym derivative 
L{z) = lA{z){dFz /dG){z) is well defined. If G{dz) is admissible, then 

(3) P{Z eA) = E^L{Z) = J L{z)G{dz). 

Here, we are using E^{-) to denote an expectation that is computed under 
the probability measure G(-) [similarly, we will use VarG'(-) for variances 
under G{-)]. The random variable L, which is clearly an unbiased estimator 
of P{Z S A), is an importance sampling estimator. In some discussions on 
importance sampling, the likelihood W{z) = {dFz /dG){z), when is well de- 
fined, is said to be the "importance sampling weight" [see, e.g., Liu (2001)]. 
When ^^(z) is well defined, then one can write L{z) = W{z)lAiz). 

The idea behind importance sampling is to take advantage of representa- 
tion (3) in order to estimate P{Z G j4). In particular, one can simulate k i.i.d. 
(independent and identically distributed) copies of Z using the distribution 
G(-) and output the estimator 

(4) wi'J = lj:L{Z,). 

By the LLNs and identity (3), is a consistent estimator of P{Z G A) 

as k y oo. Note that importance sampling can in principle achieve zero 
variance. Indeed, if one chooses as change-of-measure 
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we obtain, say if m = 1 and Z = 

'P{Z€dziy-^ 



Wis = L{zi)=P{ZGdzi] 
= P{Z£A). 



P{Z G A) 



So, our estimate of P{Z € ^1) is exact (in particular, it has zero variance). 
Obviously, such importance sampling estimator is not feasible to implement 
in practical cases because it requires knowledge of P{Z € A), which is the 
quantity of interest. However, the form of the zero- variance importance sam- 
pler indicates that a good change-of-measure should be similar to the con- 
ditional distribution of Z given that Z £ A. 

2.3. Zero-variance importance sampler for binary contingency tables. Note, 
conditioning on Xi, that u(r, c) satisfies 

u{r,c) =£'r,c('"(Si,Pi)), 

where pi = {c2, ■ ■ ■ , c^); note that the dimension of pi, which we shall denote 
by size{pi), equals n — 1. More generally, at time < A; < n — 1 

(5) u{sk, Pk) = Es^,p^{u{Sk+i, Pk+i)), 

where Pfc+i = (cfc+2, • • • ,c„) and size{pk) = n — k. If we denote the empty 
vector by the symbol *, we must have that u(0,*) = 1 and u(r, *) = for 
r /O. 

Let us define a Markov kernel Q*p^{-) (for < A; < n) via 

Note that (5) guarantees that Q*^ is a well defined Markov kernel [i.e., the 
probabilities (5*^(sfc,-) is a probability mass function]. In order to simplify 
the notation in what follows, we shall drop the explicit dependence on the 
subindex p^. If one could use P*^ (•) as importance sampler for simulation 
[i.e., simulate the process (S^ -.0 < k <n) according to transitions generated 
by Q*{-)], then our likelihood ratio estimator would be (given Sq = r and 
Po = c) 

HSn = 0) n -^^^ = = .(r, c), 

which has zero variance. Therefore, Q*{-) corresponds to the zero-variance 
importance sampling distribution. The kernel Q*{-) is the so-called Doob's 
/i-transform and describes the conditional distribution of the process S given 
that S„ = 0; see Doob (1957). 
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The efficient design of importance sampling algorithms should take ad- 
vantage of any available information about u{-). For instance, if we know 
that u(-) is in some sense close to some computable function v{-), then given 
our previous discussion, it is natural to consider a transition kernel of the 
form 



where w{s]^,pf^) is the appropriate normalizing constant that makes Q{-) a 
well defined Markov transition kernel. Once again, for notational simplicity, 
we suppress the explicit dependence of pf^ in Q{-) but keep in mind that 
Q{-) is a time inhomogeneous Markov kernel. This is the strategy that we 
shall pursue in the next section in order to describe an importance sampling 
scheme that can be rigorously shown to be efficient in a context of sparse 
tables. An early reference that explores the connection between /i-transforms 
and importance sampling is Glynn and Iglehart (1989) [see also Asmussen 
and Glynn (2007) and Juneja and Shahabuddin (2006) for more applications 
of this idea]. 

3. Approximating the optimal change-of-measure and algorithm design. 

In order to apply the strategy outlined at the end of the previous section, 
we need to find a suitable approximation v{-) to u{-). Results from McKay 
(1984) and Greenhill, McKay and Wang (2006) will allow us to obtain valu- 
able information on n(-) that we will exploit in order to design an efficient 
importance sampling algorithm. In order to develop the required approxi- 
mations, it is useful to introduce some notation. 

As we indicated at the beginning of the previous section, //(r, c) represents 
the number of tables with fixed column sums vector c and marginal row 
sums given by r. Note that with this definition of /i(r,c) we have n(r,c) = 
fi{r,c)/rj{c,m). We shall assume that the Cj's are ordered in a nonincreasing 
way so that ci > C2 > ■ ■ ■ > Cn- Having the cj's ordered in this way does not 
affect the asymptotic approximations that we are about to describe, but as 
we shall see, the ordering is important for the good performance of SIS. 

We now introduce some convenient notation as in Greenhill, McKay and 
Wang (2006) that will be useful throughout the rest of the paper. Given a 
number s and an integer k>0, we define [s]k = s(s — 1) • • • (s — /c + 1) and 
[s]o = 1. Given a vector s = (si, . . . , s„(,) of dimension no, we set [s]o = 1 and 
define, for any integer k>l, 




and also s! = si!s2! • • • s„„!. 
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Define 

I N Ml! [c]2[r]2 

We now are ready state the following result, the proof of which is given 
at the end of the section. The next theorem is basically an adaptation of 
results from McKay (1984) [see also Theorem 1.1 of Greenhill, McKay and 
Wang (2006)]. 

Theorem 1. Assume that max([c \y]^) = 0{d) and 
ri)=o{d^/^) asd/oo. Then 

/i(r, c) ~ ip{r, c) exp(-a(r, c)) 

as d ^ oo. 

The previous result is slightly different from that of McKay (1984) who re- 
quired maxj<m,j<n{?^i, Cj} = o{d^/^) but did not assume max([c^]i, [r^]i) = 
0{d). Further refinements have been given in Theorem 1.3 of Greenhill, 
McKay and Wang (2006) who introduce additional correction terms by as- 
summg maxj<mrj x maxj<„, Cj = 

Continuing in the spirit of our discussion at the end of the previous sec- 
tion. We are interested in proposing a function v{-) that mimics the behavior 
of u{-) in some sense in order to construct our importance sampling algo- 
rithm. Theorem 1 suggest using the approximation 

. A ^(r,c)exp(-a(r,c)) 



?7(c, m) 

Let us define f(0,*) = 1 and v{s,p) = if at least one component of s 
is negative. As we indicated at the end of last section, our discussion of 
the zero- variance change-of- measure, Q* , suggests designing the importance 
sampling distribution via a Markov transition kernel of the form 



(6) Q(sfc,Sfc+i) = 

where pp. = (c^+i, . . . , Cn) and 



m 



f(Sfc+l,Pfc+l) 



X — ^ / TTl 



Ck+lJ w{Sk,pk) 

1 ip{Sk+l, pk+l) 



(Sfe,Pfc)->{Sfc + l,Pfe+l) 



Cfe+1/ r]{pk+i,m) 



is the normalizing constant that makes Q{-) a well defined Markov transition 
kernel. In the previous display and in the discussion that follows, we use 
(sfci Pfc) (sfc+i) Pfc+i) to denote an admissible transition step [i.e., — s^+i 
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is an m-dimensional 0-1 whose components add up to Ck+i and p^+i = 

{Ck+2,---,Cn)]- 

We shall mention how to simulate transitions under Q{-) right after the 
precise description of the proposed algorithm below. We will use P^c{-) to 
denote the probability measure induced by the random walk S under the 
transition kernel given that Sq = r and Er,c{-) to denote the corresponding 
expectation operator associated to P^c{-)- 

Note that under the change-of-measure i-'^c(-) we may have P^c{Sn 7^ 
0) > 0. Therefore, when running an importance sampling algorithm based on 
transitions according to Q{-) we may obtain realizations for which {S„ 7^ 0}. 
A sufficient condition that implies {S„ 7^ 0} and which can be easily checked 
at a time A; < n is that the number of strictly positive components of is 
less than Ck+i - So, the path generation under Pr,c(-) will be done sequentially 
according to the transition kernel Q{-) until time n (in which case we have 
that the event {Sn = 0} has occurred) or up to the first time k such that 
the number of strictly positive components of is less than Ck+i (in which 
case we have that {S^ 7^ 0}). 

In order to explain this path generation scheme more formally, let us 
define 

$(Sfc) = card{j : Skj > 0}. 

That is, <&(Sfc) is the number of strictly positive components of S^. Put 
Cn+i = 1 and define a stopping time r via 

r = inf{0 <k <n: $(Sfc) < Ck+i}. 

Observe that when {r < n} occurs one of the components of the vector Sn 
must be negative and, therefore, {S„ 7^ 0}. On the other hand, if S,- = 0, we 
must have that r = n because the Cj's are strictly positive and X]j=i — 
Therefore, we have that {S„ = 0} = {S,- = 0}, and consequently 

7x(r,c)=Pr,c(S, = 0). 

The path generation scheme that we described before under the measure 
-P^c(-) will be done sequentially up to the stopping time r. Note that the kth. 
column, namely X^, is generated under -Pr^c(-) during the course of the path 
generation only if r > /c — 1. In turn, X^. is a binary vector such that the sum 
of its components equals and Pr^c{') avoids assigning negative components 
to the random walk S and, therefore, generation of increments under P^d') 
can be performed up to time r. If r < n, then the rth assignment under 
-P^c(-) cannot be done and the estimator is just zero. If r = n, then the 
table is constructed satisfying the row and column sums. We then conclude 
that Pr^d') is admissible in the sense that it does not assign zero mass to 
outcomes that are possible under -Pr,c(') and for which S„ = 0. In fact, it 
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turns out that the sequential importance samphng algorithm generated by 
Q{-) coincides with one of the procedures studied by Chen et al. (2005). In 
order to see this, note that 



+1) 



oc 



\Ck+lJ w{Sk,Pk) v{Sk,Pk) 

visk+i,Pk+i) _ ^{^k+i, Pk+i)r]{pk^rn) 



v{sk,pk) ip{sk,pk)viPk+i,m) 
V^( sfc+i,Pfc+i) ^ _Sfc! 
•fi^kiPk) Sfc+i! 

oc \Y (sfcjexp(27fcSfcj)), 

jGli: Sfe+l,j7^Sfc,j} 

where 7^ = YJjZ2{Pk,j- Pk,j) / {'^{d- Pk,i)) (note that card{i : Sk+i,j + ^k,j) = 
Pk,i)- The proportionality relations are introduced to emphasize the de- 
pendence of the transition kernel only on the k + 1th increment, namely 
Sk+i — Sfc. The last line of the previous display coincides with the descrip- 
tion given in page 112 of Chen et al. (2005) [the complete details of the 
computation corresponding to the difference a(sfc,p^) — a{sk+\-, Pk+i) ^'^^ 
given in Section 5; see (17)]. 

The precise form of the algorithm that we analyze, based on the transition 
kernel Q(-) defined in (6), is given next. 

Algorithm 1. 

Step 1. Order the Cj's so that ci > • • • > c„ and set s < — r, p < — c, 
L < — 1 and I < — 0. 

Step 2. Let A = {i : Si > 0} and define = card(>t). Put ci < — Pi, 
p < — {p2, ■ ■ ■ ,Psize{p)) and / < — I + 1. If m_A < ci, put L = and GO TO 
Step 3. Otherwise, if ^ < n, then evaluate 

7^[p]2/2[/9]? 

else, if / = n set 7 = 0. Sample (1^^ , . . . , ) according to the distribution 

(7) P(y,, = y., , . . . , Yi„^^ = Vi^^ ) = ^ n exp(275, ))''^' , 

i=i 

where J2Y=i Vij = ci, Vij G {0, 1} and 

{(2/n v,yi™ , ): J/ii+---+2/im . =ci} i = l 
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Then update 



L< 



and for j A put < — Sj — Yj . 

Step 3. If L = orn = output L and STOP, otherwise, GO TO Step 

2. 

The output of Algorithm 1 is given by 

Equation (7) describes the distribution of Xj+i given Sj under Q{sj, •). Sam- 
phng according to such distribution can be done, adopting the terminol- 
ogy used by Chen et al. (2005), using the so-called "drafting method" for 
Conditional-Poisson (CP) distributions described by (7). 

For completeness, we shall review the basic properties of the drafting 
procedure; our discussion follows [Chen et al. (2005) and Chen, Dempster 
and Liu (1994)]. Given a distribution of the form 

(8) P{Zi = Zi,...,Zm = Zm) = 

where wj > and Zj € {0, 1} for 1 < j < m, the drafting method allows to 
both, sampling the Zj's and efficiently computing the normalizing constant 
w. The drafting method is a sequential procedure that allows to sample c 
units without replacement from the set Am = {1,2,..., m}; the ith unit has a 
probability proportional to Wi. Let Af^, < k < che the set of selected units 
after k draws, so that = and Ac is the final sample to be obtained. At 
the kth. step (with 1 < k < c), a unit j € is selected into the sample 

with probability 




p(j,^: 



^ wic-k,Al_-^\{j})wj 
{c-k + l)w{c-k + l,Al_^)' 



where 



CCA,cavd{C)=i \ieC ) 

iZ;(0, A) = \ for all A C A„i and wii^ A) = for i > card(yl). The computation 
of the i/;(i,^)'s is performed using the recursion 

w{i, A) = w{i, A \ {j}) +w{i-l,A\ {j})wj . 
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For instance, to compute w(c, ^4^), we apply the recursion 

w{i,Aj) = w{i,Aj \ {j}) +w{i - l,Aj \ {j})wj 

for 1 <i <c and i < j <m. It follows that computing w{c, Am.) takes 0{cm) 
operations. Evaluating p{j,Am) =p(i, Aq) then takes 0{cm?) operations. 
Each of the ^^)'s can be evaluated similarly, however, it is more con- 
venient to use Lemma 1 of Chen, Dempster and Liu (1994), which states 
that 

^c-k)iw.,-wM^k,Al_,) 

for 1 < k < c — 1 and j G where is the element selected in the kth. iter- 
ation of the drafting procedure. Therefore, we conclude that it takes 0{cm?') 
operations to generate a sample from (8) using the drafting method. Chen 
and Liu (1997) discuss four additional sampling procedures for CP distribu- 
tions with similar complexity properties. The previous considerations imply 
that the computational cost per replication of an importance sampling al- 
gorithm based on Q{-) is of order 0{m'^d + n\og{n)). The contribution of 
the term nlog(n) corresponds to ordering the Cj's in Step 1 and computing 
7 in Step 2. Note that subsequent updates of 7 can be done recursively so 
there is no need to add an extra factor from the fact that the algorithm goes 
through Step 2 n times. We summarize these observations in the following 
lemma. 

Lemma 1. Algorithm 1 requires 0{m?'d + n\og{n)) operations to be ex- 
ecuted. 



Chen et al. (2005) also proposed a more refined importance sampling 
procedure which can be explained as follows. Note that we constructed our 
importance sampling transition kernel, Q{-), via a suitable approximation 
w(r,c) of ti(r,c) that is valid as d y 00. Furthermore, we introduced addi- 
tional information into f(-) by defining v{s,p) = if s contains at least one 
negative component. Intuitively, we could have done even better by setting 
v{s,p) = whenever u{s,p) = 0, this is the idea behind the refinement pro- 
posed by Chen et al. (2005). One immediate difficulty here is the question 
of how to easily test the pairs (s, p) for which u{s, p) = 0. This is achieved 
by making use of a characterization of so-called graphical sequences (i.e., 
degree sequences that can give rise to a bipartite graph) in terms of certain 
constraints that can be easily checked during the course of the simulation. 
Introducing these types of constraints on the support of Q together with 
asymptotic approximations may help produce efficient importance sampling 
estimators (in terms of the discussion given in Section 4). However, in our 
current context, the vanilla version of the importance sampling procedure. 
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indicated in Algorithm 1, will already be proved to be efficient in a precise 
mathematical sense to be described in the next section. 

Proof of Theorem 1. We follow closely the steps in the proof of 
Lemma 2.2 and Theorem 1.3 in Greenhih, McKay and Wang (2006) (GMW). 
First, we introduce their counting model. We consider a set of d labeled 
points arranged on m cells, say Qi, ■ ■ ■ , Qm- The cell Qi contains elements. 
Similarly, we consider another set of d labeled points arranged in n cells 
denoted by Hi, . . . , H„ and assume that the jth cell, Ej, contains Cj elements. 
We then have 2d labeled points in total. A partition of the 2d elements into d 
unordered pairs is called a pairing. Each pair is denoted by e = {p, where 
p ^ Qi for some 1 < i < m and ^ G Hj for some 1 <:j <n. We also write v{p) 
to denote the cell corresponding to the point p and similarly v{^) to denote 
the cell corresponding to the point ^. A random pairing is a pairing that is 
chosen uniformly at random out of the d\ possible pairings. Two pairs are 
called parallel if they involve the same cells. An error is an unordered set of 
two parallel pairs. 

It follows easily that the probability of obtaining / > given pairs occur- 
ring in a random pairing is l/[d]z. Let pa be the probability that no errors 
occur in a random pairing. As noted by GMW, we have 

/i(r,d)r!c! =d\pd, 

because (up to a permutation in the labels of the elements in each of 
the cells) each contingency table corresponds to a pairing that has no er- 
rors. Therefore, it suffices to estimate pa which is done, once again fol- 
lowing GMW, using inclusion-exclusion and Bonferroni's inequalities. The 
inclusion-exclusion development is applied as follows. First, given two dif- 
ferent pairs e and e' define B{e;e') to be the set of pairs that contain the 
particular error {e, e'}. Note that pd = l —Pdi where p^ is the probability 
that at least one error occurs in a random pairing. In turn, p^ is less or 
equal to the total contribution corresponding to placements with one error, 

which we denote by h^P ■ In particular, = Z]{e,e'} -^(-^(^5 ^O)- More gen- 

—(k) 

erally, let us define as the total contribution in the inclusion-exclusion 
development corresponding to pairings that contain k errors or more. So, for 

—(2) 

instance, = S{{e,e'},{e,e'}} -f (-^(c, e'), S(e, e')), the sum runs over sets of 
two different errors. We then have that for k>l 

(9) T>'^ + - + 6f "^^ - -bf^ < Vd < 6? - 6f + ■ • • + 4"-^^ • 
— (fc) 

Note that by,k>2, can be divided in two parts, namely, one that contains 

(k) 

errors that do not have a pair in common, which we denote by /S^g , 
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another part that contains errors that have pairs in common, which we 
denote by pj^^l . We define /3^^q = b^J'' and note that 

by = [Y. q(c, -1)Y: rj{r, - 1) j /2 = a(r, c) + o(l) 

as d y oo. We claim that for each k>2 

(10) = 

as d /" oo. To see this, let us first define Af^ as the set of ordered 2A;-tuples of 
pairs (ei, 62, . . . , e^, e'^, . . . , e'^,) [with e^- = (p^, ^j) and e'j = {p'j,^j)] satisfying 
for each / G {1, . . . , k}, ii = i[ and ji = j[ where 

v{pi)=ii, v{p'i)=i'i, 



with ii / and j/ / if / / s. 
Note that 



Pd,0 



2''k\ [d]2k' 

We claim that 

(11) |A6c+i| = |A4|(^(^E[q]2EN2) +0{d')^. 

To verify this claim let us pick an arbitrary element (ei, 62, . . . , e^, e'^, . . . , e';^) G 
A/fc- We obtain an element of A4+i by adding two parallel pairs (e^+i, e'f.^^) 
so that we obtain k + 1 errors that do not have pairs in common. This is 
achieved in 

(n k \ / ™ ^ \ 

1=1 1=1 / \j=l 1=1 / 

many ways. Now, since maxj<„Cj = o((i^/^) we have that 

S|LiK]2 ^^/ max,<^cj \ 
d ~ \ d J 
as d /' 00 (a completely analogous estimate also applied to the sum involving 

(k) 

the Tj, 's). This implies (11) and as a consequence (10). To study it 
suffices to perform a very rough analysis. Indeed, note that 



1=2 
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where the Ith term in the previous sum corresponds to cohections of k errors 
in which all the pairs belong to I errors or less. Now, we have that 



En l+l v^m , 
1=1 ^j=l 



J+l^m l+l n m II 

< max -TTiK X max -^ttt > > ■ 



i<n j<m dM'^ S 

1=1 0=1 



Since / > 2 we have that Ya=i J2f=i Ci'fj/d'- = 0(1) as d / oo and we con- 
clude that = o(l) as d oo. Therefore, 

a(r, c)'^ 



"d - Pd,0 + Pd,l 



+ o{l) 



as d y oo. In order to complete the argument, recall that for each c G (0, oo) 



lim sup 

>Oo<a;<( 



exp(x)5:(-l)^-^-l 



j=0 



J! 



0. 



Under our current assumptions, we have that Q(r,c) = 0(1), therefore, the 
previous estimate for the exponential function together with (9) yields the 
conclusion of the result. □ 



4. Complexity notions in rare-event simulation. In these section, we 
shall briefly discuss basic notions of efficiency that are helpful to calculate 
the computational cost (in terms of the number of replications) of estimat- 
ing small probabilities via simulation using an estimator of the form (4); 
for more on efficiency of rare-event simulation estimators [see Asmussen and 
Glynn (2007), Bucklew (2004) and Juneja and Shahabuddin (2006)]. 

Let /3 = P{Z € A) and suppose that /3 « 0. In order to be precise, we 
shall introduce a parameter d such that f3d = P{Zci £ A) — > as d /" oo and 
perform our cost analysis under this asymptotic regime. 

Our goal is to produce an estimator, Pfi,ky with the property that for 
given e,S £ (0, 1), \Pd,k — Pd\ ^ f^d^ with probability {1 — 6). If /3d,fc has this 
property, we say that (3(i,k has e-relative precision with 1 — 6 confidence. Here, 
we use the subindex k to denote the number of i.i.d. replications required 
to produce (3d^k- Let {Ldj :j > 1) be i.i.d. r.v.'s such that ELdj = (3d and 
consider the unbiased estimator 

1 ^' 

f^d,k = ^J2^'id■ 
A standard way to measure the efficiency of the estimator Pd,k in the rare- 
event simulation literature relates to its variance measured in relative terms. 
This approach gives rise to the notion of strong efficiency. More precisely, 
if (7^ = Var(Lrf) < oo, then the L^'s are said to be strongly efficient if the 



18 



J. H. BLANCHET 



corresponding coefficient of variation, cvd — crd/Pdi is uniformly bounded 
for d>0. In particular, in the context of importance sampling estimators 
discussed in Section 2.2, see (4), L^j = L(ij{Zcij) and = YavG{Ld). In 
other words, the variance must be computed according to the underlying 
importance sampling distribution. 

One often says that is strongly efficient meaning that the family of L^'s 
is strongly efficient. In order to motivate strong efficiency in terms of the 
computational cost (measured by the number of i.i.d. replications) required 
to produce an estimator that has e-relative precision with 1 — 6 confidence, 
one can use Chebyshev's inequality to obtain 

2 

Therefore, k > £~'^5~'^{ad/ (id)'^ replications are sufficient to produce an esti- 
mator that achieves e relative precision with 1 — 5 confidence. Consequently, 
if Ld is strongly efficient, the number of replications required to obtain e- 
relative precision with 1 — 5 confidence is bounded as jSd — > 0. Obviously, 
strong efficiency alone is not a useful concept for measuring computational 
complexity because nothing has been said about the computational cost 
attached to each replication. 

When dealing with discrete structures, such as binary contingency tables, 
it makes sense to measure the cost per replication in terms of the amount 
of information (number of bits) required to encode the family of problems 
at hand (i.e., the size of the problem). In the context of binary contingency 
tables, statistical applications such as those described by Chen et al. (2005), 
require estimating the whole distribution of statistics that depend on all the 
entries in the table in order to perform an hypothesis test. As a consequence, 
it makes sense to parameterize the size of the problem, say d, in terms of 
the number of bits required to encode a binary table, which can be taken to 
be the number of ones (or the number zeros, but if the table is sparse, it is 
obviously cheaper to encode it in terms of the number of ones). 

The total complexity involves multiplying the number of replications, k, 
times the cost attached to the generation of each replication which we shall 
denote by K,(d) (the cost per replication is measured by the total number 
of operations such as additions, multiplications and comparisons in terms 
of the size of the problem). Therefore, in the presence of strong efficiency, 

by setting the number of replications k = @{e~'^5~^), we see that Pd,k re- 
quires 0(K((i)e"^(5~^) operations as d y oo and \ to achieve e-relative 
precision with 1 — 6 confidence. 

The notions of efficiency discussed in the previous paragraph are related 
to standard notions found in randomized algorithms and approximate count- 
ing, such as that of fully polynomial randomized approximation schemes 
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(FPRAS) [see, Mitzenmacher (2005), page 254]. In particular, an algorithm 
that outputs and estimator that has e-relative precision with 1 — 5 confi- 
dence in 0{K{d)e~^^ x \og{6~^)^^) operations, for some ki, A;2 > 0, as d oo 
and e,5 \ is a FPRAS if K{d) grows polynomially in the size of the prob- 
lem, say, d. Because of the factor log{6~^)^^ that appears in the definition 
of a FPRAS, which is much smaller than the factor that arises in the 
context of strong efficiently, we introduce a stronger form of efficiency that 
we shall call exponential efficiency. 

Definition. We say that the family of estimators {L(i:d> 1) is expo- 
nentially efficient for estimating (3d if there exists > such that 

(12) ^(0)^suplog^exp(^Lrf//3d)<oo. 

d>l 

Remark. In the context of importance sampling estimators introduced 
in Section 2.2, the expectation in (12) is taken with respect to the underlying 
importance sampling distribution. 

The next lemma, which is a uniform version of Chernoff 's bound, will be 
useful to relate an estimator of the form ^ to a FPRAS. 

Lemma 2. Suppose that the family of estimators (L^ : d > 1) is exponen- 
tially efficient for estimating (3d, then for e > we have 

(13) P{\(3d,k-Pd\>sPd)<2exp{-kmm{I{e),I{-e)), 

where I{h) = supe(6'(l + h) - i){9)). Moreover, /(e), /(-e) > and I{h) > 
ph? for some p> 0. 

Proof. Just as in the proof of Chernoff 's bound, (13) follows by an ap- 
plication of Chebyshev's inequality. Let ipdiG) =^ogE exp{6Ld/ Pd) we then 
obtain 

Pihk -I3d> e(3d) < exp f-fcsup(^(l + e)- M0))) 

= exp(^-k sup{e{l + e) - ipdiO))^ < exp(-A:/(e)). 

Similarly, one obtains 

PiPd - hk > ePd) < exp{-kli-e)). 

Inequality (13) is obtained by adding up the left- and right-hand sides of 
the previous displays after simple manipulations. The last part of the lemma 
follows from the convexity of ■0(') (supremum of convex functions is convex) 
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combined with the fact that ipd{0) - 6 = cv\Q'^ jl + 0(6'^) as 6* \ uniformly 
over d (which holds by Taylor's theorem and exponential efficiency) and 
the bound sup(^>2 cv\ < oo (which once again follows from exponential effi- 
ciency) . □ 

One way to verify exponential efficiency is by showing that is bounded 
above by some deterministic constant, say c^, such that = 0{j3d) as d y 
oo. An immediate consequence of the previous result is that if the family 
(Ld - d > 1) is exponentially efficient and K{d) operations are required to 
generate a single replication of Ld- Then by setting k = 0(e~^ log((5~^)), we 
see that Pd,k requires 0{K,{d)e~'^ log{6~^)) operations to achieve e-relative 
precision with 1 — 5 confidence. If K{d) grows polynomially in the size of the 
problem d, then (3d,k is the output of a FPRAS. 

In addition to considerations related to the way in which the coverage 
parameter 5 enters the complexity analysis [in the form for strongly 
efficient estimators and log((5"^) in the context of exponential efficiency], 
exponential efficiency guarantees robustness properties that are desirable in 
practice when constructing confidence intervals via the central limit theorem 
[see the discussion in L'Ecuyer et al. (2008)]. 

5. Complexity analysis. This section is dedicated to the proof of the 
following theorem which is our main result. 

Theorem 2. Suppose that maxi<rn'''i = 0(1), maxj<nCj = o{d^^'^) and 
that [c^]i =0{d) as d y oo: 

(i) Then the estimator L provided by Algorithm 1 is strongly efficient 
as d y oo. Since according to Lemma 1, each replication of L requires 0{d'^) 
operations, the computational complexity required to estimate u{r,c) with 
e-relative precision and (1 — 5) confidence is of order 0{e'^6~^d^) as d y oo 
and e, 6\0. 

(ii) Moreover, if in addition we have that maxcj = o{d^/'^~^") for some 
Sq> as d y oo, then the estimator L provided by Algorithm 1 is exponen- 
tially efficient as d y oo. Consequently, Lemma 1 implies that 0(e~^ log{6~^)d^) 
operations are required to estimate u{r,c) with e relative precision and (1 — 5) 
confidence as d y oo and e, 5 \ 0. 

The following basic result (whose proof is given at the end of this section) 
will be very useful in the analysis of the likelihood ratio produced by our 
importance sampler. 

Lemma 3. Let {xj :j > 1} be a sequence of positive integers and let us 
write {xi^n : 1 < i < n} to denote any nonincreasing arrangement of the set 
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{xi : 1 < i < n} so that 

Define y^^^ = Ej=fc+i ^j,n '^^d yf^^ = Ej=fc+i ^i,n /or < A: < n - 1. Then: 

(i) (2) (2) (2) 

^fc+l,n , ^fc,n ^ ^0,n 

(1) - (1) - (1) ■ 
yk+l,n yk,n Vo^n 

(ii) // vl^n/yQ^n = 0(1) as n /' oo, then there exists a constant a > 
( independent of n and k ) such that 

(14) y^l<ain-k) 

as n y Qo. Moreover, if xi^n = o{n^"~^") for < (^o < /^o !^ 1/2 then we also 
have that 

(15) ^< 



,,(1) - l + (n-j)^-A)+'5o■ 
(iii) Under the assumptions of part (ii), if 5q>Q, then 

(16) supy^ < cx). 

">1 j=l yj-i,n 

The previous result will be applied repeatedly to the sequence of c^'s 
which is assumed to be ordered in a nonincreasing way, namely, ci > C2 > 
■ ■ ■ >Cn- So, for instance, assuming that [c^] = 0{d), then given Pq = c and 

Pfc = (Pfc,li • ■ ■ Pk,size{pk)) = (Cfc+1, ■•■,£«) 

for j < n — 1, (15) implies that there exists no such that for all k <n — uq 
we have that then Pk,i/[Pk]i — 1/2- Similar implications are immediate from 
Lemma 3 and will be invoked in our future discussion. 

We now proceed with the development behind Theorem 2, we first start 
with part (ii). By running Algorithm 1, we obtain the estimator 

^=o^'(Sfc+l,Pfc+l) 

_ ^(r.c) ""y^ w{Sk,Pk) 
-.(0,*)11 ,(s,,p,y(S"-0)' 
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where (as indicated before) v{0,*) is defined as 1. Recall that 

v{r, c) n(r, c) 

as d y oo. Therefore, in order to show exponential or strong efficiency, we 
must study the properties of Rd defined as 

i?,((So, Po),..., (S„_i, p„_i)) ^ n 4|^nSn = 0), 

given Sq = r and Pq = c. The analysis of involves studying the ratio 
w{so,po)/v{so,Po) 

w{so,Pq) ^ / m v{si,Pi) 

v{So,Po) ~\PO,lJ (r,cH(s,p)^(^0,Po)' 

Note that 

v{si,Pi) ^ y(si,Pi)7?(Po,m) 
v{so,Po) v{so,Po)r]{pi,m) 

and (using the notation pij to denote the jth component of the vector Pj 
for i G {0, 1} and recalling that pi^ = po,j+i) 

ri{pQ,m) _ ( m \ / m \ / / m \ ( \\ ^^ _ ( 
viPi,m) VPo,i/ \PO,nJ VVPo,2/ \Po,nJJ VPo,i. 

Next, observe that si is obtained by selecting a set T = {ii, . . . , ip^ ^ } of 
(ordered) subindices and by picking the ith component of the vector si, 
namely si^j, via si^j = so,i — 1 (i G T). Consequently, we have 

¥^(si,Pi) ^ / 4 
'/'(scPo) VPo,!, 
where do = J^JLi sqj = Z]j=i Po,j [with no = size(po)]. Therefore, 

^$^^=fn^°) E (nigrso,i)exp(-(Q(si,pi) -a(so,Po))). 

Let us provide a more convenient expression for the previous ratio. First, 
we write s^, ^ = {si^,. . -,81^^ J and define 7 = [pi]2/{2[pi]l). Then (using 1 
to denote the vector of ones) we have 

a(si,Pi) =7[si]2 = 7([so]2 - [2(so,r - l)]i), 



(Iljerso.j) exp(-(a(si, pi) - a(so, Po))), 



[Poh [PoJi 2[po]j,' 



2 
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27po,i[so]2 , 7Po,i[so]2 [po,i]2[so]2 

= 7 So 2 } ] \ 7 T2 \ Vl • 

[Po]i [poll 2 [Poll 

Therefore, we have 

- a(so,Po) 

(17) 

-,M , 27/)o,i[so]2 7/'o,i[so]2 [po.i]2[so]2 
= -7[2(so,r - Ijji + 



[Poll [Poll 2[po]f 

Define /3(so,Po) ^'^^ 

, a, \ 27po,i[so]2 , 7^0,1 N2 [po,i]2[so]2 

and 

/i(so,r,so,i) =n^=U'5o,ij exp(27(so,j, - I)))- 
We now are ready to provide an estimate for the ratio w{sq,Pq)/v{sq,Pq). 

Lemma 4. Assuming that maxj<m?'i = and that [c^]i = 0{d) as 

d y 00 there exists a constant A G (0, 00) such that 

"^('°'^°)<exp('A[^°'^]^ 



v{so,Po) \ [poll/ 

Proof. Let us define /3o,i i.i.d. random variables Ji, J2, ■ ■ ■ , Jp^ ^ with 
distribution 



exp(27Soj)soj 



til 

where w = Z^^i 6xp(27Soj)'Soj and m = size{sQ). In addition, we shall use 
E{-) to denote the expectation operator associated with P(-) and define the 
event A = {Ji ^ Jj-.i / j} (i.e., all the Jj's are different). We have 

Let us first analyze w. Note that under our assumptions 

m , „2 

W ■ 



E5o,,|l + 275o,, + (27)'^ + ---j 
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We then conclude that 

Po,i / po,i! 



Po,i / po.i! V [so]i V [so]i 



k=l 

Assuming po,i/[so]i < V^j we have 
po,i-i 



fc=i 

and, therefore, we have 

[so] A"' ^5^°'^ 



[so] J - 2[so]i 6 [so]? 



log 



Po,i J po,i! 
^ 27Po,i[so]i _^ [po,ih _^ Po,i(Po,i + l)(2po,i + 1) 



+ 



[sd]i 2[so]i 6[so]f 
7^[so]iPo,r 



[so]] 



We now estimate P{A'^) using the inclusion-exclusion principle and Bon- 
ferroni's inequalities. We have that 

~ / \ 1 



this corresponds to the union bound taking all possible ways in which { J^^ = 
Jjj} for ii y^i2- Next, we obtain the following lower bound corresponding to 
the cases in which {Jj^ = Jjj, Jjg = Ji^}, 



2 



-j^ m 



1 y \ 3 



^II4jexp(127So,i) 



4\ ( PQ,i\ 1 



2 / V 4 



2 



~4(E4iexp(47Soj 
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We then conclude 

~ / \ 1 



+ 



2) (7 



I / m 

~4 E4j«xp(47S0,: 



\j=l 



Now, we have that 



Po,i 
2 



Po,i\ [«o]i , (l + 47[sg]i/[sg]i + (47)2[s^]i/(2![s2]0 + ---) 

X 



2 ; [soil (l + 27[sg]i/[so]i + (27)2[s3],/(2![so]i) + ---) 



2 ; [soliV '[s§]i V' Hh 

xri-27Mii+er7^[^^]^ 



\ [so]i V [so]i. 



Note that 

[so]i [so]i 
_2[si]]i[so]] 



s, 
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oil 



[so]i[sg]i 

[s^]l _^ ^'Ei<jiSo,i^O,j + SO.iSoj - ^O.i'^Oj) 



[so]i[s5]i [so]i[so]i 

"0,i 



[so]i[so]i [so]i[s, 

^ [So]l , [P0,l]2 

— r T r On "T 



2] 

oil 



[so]i[s5]i [so]i[s5]i' 
As a consequence, 



Po,i 
2 



^E4jexp(47So,: 



w . 
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- I 2 ; [so]? [so]i[s2]i + [so]i[s2]i [soil 



Therefore, 



+ (9( [PO'iJi _^ [P0,l]2 



[so]f [so]? 

P0,1^ [So]l I Qf WlU I [P0,l]2 _^ [/'0,l]2 



[so]f V [so]f [so]f [so]f 



[so]f V [so]f 

We now group all of our terms together in a convenient way in order to 
estimate the ratio w{sq, Pq)/v{sq, Pq). In order to do this we define the terms 
Xi, X2 and xs as 

27Po.i[so]2 , 27po,i[sg]i 
Xi = F-n + FTl 27po,i, 

f P0,1 \ H]l , Kl]2[so]2 ^ [P0,l]2 



2 ; [so]f 2[po]2 2[so]i ' 

Po,i(Po,i + l)(2po,i + l) , 7Po.i[so]2 „/[po.l]4 , 7^[So]l/00,l 

X3 = 77-12 + r , 12 +0[ ^-Tr + 



6[so]? [Po]f V [so]f [so]i 

Note that X3 = 0{[poA/[so]l), consequently if po,i/[Po]i < 1/2, then 

logf^4^)=Xi + X2 + 0^[^°^^]^ 



v{so,Po) J V [soil / ' 

Finally, we compute xi and X2', first, we have that 

27Po,i[so]2 , 27po,i[sg]i 27po,i 21 ^ 

f—l ^ [71 = r 1 (-[S0J2 + [Soil, 

Pol So 1 So 1 



27/00,1 r 1 „ 

— r^[So]l = 27po,i. 
so 1 
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Therefore, we have that xi = 0. A similar computation yields X2 = 0, indeed 

'PO,l\ [So]l , [/50,l]2[so]2 , [P0,l]2 _ [P0,l]2 ^ rOi , i N , [PO,lh 



2 ; [so]? + 2[po]? + ^ " + + 2[so] 



2[so]f ^ 2[so] 
The result of the lemma then follows. □ 

A consequence of the previous result is the following corollary. 

Corollary 3. Assuming that maxj<m'"i = 0(1), [c^]i = 0{d) and 
maxj<„Cj = o{d}^^~^^'') as d oo, then there exists a constant X* G (0,oo) 
independent of d such that 

Rd{iSo,po), . . . , {Sn-l,Pn-l)) < A*. 

Proof. Iterating the estimate obtained in Lemma 4, we obtain that 



Rd{{So,Po),---,iSn-l,Pn-l)) <Koexp[ 

\ k=o ^Pkli 



^ [Pk,i\i 



The result then follows as a consequence of (16) in Lemma 3. □ 

With Corollary 3 at hand, we have all what is needed to establish ex- 
ponential efficiency. However, before we put all the pieces together let us 
continue with the basic elements behind the strong efficiency properties in- 
dicated in part (i) of Theorem 2. We then will conclude with a summary of 
all our results and the complete proof of Theorem 2. 

In order to establish strong efficiency we must study the function 

.(so, Po) ^ i^s« po (Rl) = Eg,,, ill S|f;^^(S" = • 



\k=0 



In particular, we must show that 5(so,Po) remains bounded as [so]i oo. 
Our strategy is to derive a linear inequality for g{-) and show that one can 
satisfy this inequality with a convenient Lyapunov function / (•) that remains 
bounded as d /' oo. The next result provides sufficient conditions for the 
construction of an appropriate Lyapunov function. The corresponding proof 
is given at the end of the section. 

Proposition 1. Assume f >1 is a function that satisfies 
(19) ^^^°'^o)>^?g;^i^?„p.,/(Si,Pi) 
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as long as [so]i > do [for some do G (0,oo) fixed]. Then 

where = sup[sj,]^<rf^ 5(so, Po) < oo. 

A function /(•) satisfying the hypothesis of Proposition 1 is typically 
called a Lyapunov function in the context of stability of Markov chains 
[see Meyn and Tweedie (1993)]. Our goal is to build a bounded Lyapunov 
function /(•). Similar Lyapunov-type bounds have been studied in the rare- 
event simulation literature, see for instance Blanchet and Glynn (2008)] for 
applications in the context of rare-event estimation problems related to first 
passage time probabilities. 

Constructing an appropriate Lyapunov function /(•) is typically not a 
simple task. Nevertheless, such construction is often guided by a solid un- 
derstanding of the estimates involved in the ratio of w{sq, Pq)/v{sq, Pq) . This 
is precisely the strategy that we use to construction our Lyapunov function. 
In particular, we first put for 6o> 



/i(s,p) =exp(6'ia(s,p)) 

for some > and finally define 

/(s,p) = /o(p)/i(s,p). 

The form of this function was obtained by inspecting carefully the analysis 
behind Lemma 4. We first tried a Lyapunov function such as /i and then 
after doing some computations, recognized the need for a term such as /o 
as the proof of the next lemma indicates. 

Lemma 5. There exists 61,62 > such that /(•) satisfies the conditions 
of Proposition 1 . 

Proof. The proof proceeds along the same lines as that of Lemma 
4. Given (so,Po) us denote {si,pi) an admissible transition step [so 
that (so,Po) ~^ (^IjPi)]- particular, we have that there exists a set of 
subindexes P = . . . , ^ } such that sij = sqj — l(j € P). We write 7 = 
[Pi]2/(2[pi]i) and introduce po,i i-i-d. random variables Ji, . . . , Jp^ with 
distribution 




then set 



P{Ji = k) 



exp(27Sfc)sfc 



w 



IMPORTANCE SAMPLING FOR COUNTING 29 

where 



1=1 



We have that 

/i(si,Pi) 
/i(so,Po) 



exp{ei{a{si,pi) - a(so,Po))) 

■ exp 01 — ^ h 26'i7po,i - Oi — ' ^ 

V [poJi [poll / 

/ a [PO,l]2[so]2 r 1 

xexp -01 ' ^ 2(9i7[so,r]i 



where so,r = (soju ■ • ■ i^ojp^ We need to show that there exists do, 9i and 
62 such that for [s]i > do we have 

^0,1 



exp $■ 



^ w(so,Po)^ ^Q / /i(Si,Pi) 
- ?;(so,Po)2 ''''V/i(so,Po) 

_ ^"(so,Po)^ / m /i(si,Pi) t^(si,pi) 

^(so,Po)^ {so,PoMsi,Pi) /i(so,Po)^«(so,Po) 

^ w(so,Po) ^ / m \ /i(si,p^) ^(si,pi) 

v{so,Po) V^O'i/ /i(so,Po) ^^(so,Po)' 

As in Lemma 4, we have that 

f in \~'^ fiisi,pi)v{si,p^) 

(so,Po)^(si.Pl) 



VP0,1/ /i(so,Po) 1^(80, Po) 



w^'o. exp ((^1 - 1)^^^^^ + 2(^1 - l)7P0,i) 



/[so]i^j \~Po.i^^^f^a. -,^27po,i[so]i 
V Po,i 

fa in7Po,i[so]2 , , Jpo,l]2[so]2 

X exp -{61 - 1) ' (6*1-1) 



x^|^exp|^Xi-2^^^o,J,);Aj, 

where ^4 is the event that consists that ah the Jj's are distinct. During the 
proof of Lemma 4, we obtained that if po,i/[so]i ^1/2 then 
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^ 27/^0,1 [So]l ^ [P0,l]2 ^^f [P0,l]4 



[so]i [so]i V [poll 

Now, evidently we have that 



-20i7so,j, ];A]< {E{exp{-29irso,jW'' 



< exp(-26'i7/)o,i^so,Ji + O(6'i/Oo,i7^))- 



Therefore, combining all these estimates together with Lemma 4 we have 
that there exists a constant A > such that 

»Kso,Po)^ ^Q / /i(Si,Pi) 
w(so,Po)^ ''''V/i(so,Po) 

V [poll [so]i [so]i 
(20a) X expf {6, - i)^I^5MM + 2(6, - l)7Po.i 



{-(^1-1) 



7 Po,l[so]2 Jpo,l]2[So]2 

[Po]? ^ ' ^ 2[po]f 



X exp -{01 - 1) , - (^1 - 1 



/ 2gi7Po,i[sg] , 2.\ 
xexp(^ [g^j^ +O('9ipo,i7 )j- 

Note that in the last line of the previous display we have used the fact that 



21 



Now we note (just as we did in Lemma 4) that 

27Po,i[so]2 , 27po,i[sg]i „ „ 
f— 1 \ f-i 27/)o,i = 0, 

which implies that the logarithm of the right-hand side of (20a) equals 

JP0,l]4 , Kl]2 ,, 7Po,l[so]2 

2[PoJi 

It is immediate from the previous expression that one can select first > 
and then 62 depending on 61 so that the previous quantity is less or equal 
to ^2/Oo,i/[Po]i long as [so]i > do so that po,i/[so]i < 1/2- The conclusion 
of the lemma then follows. □ 
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It is time to summarize all the previous estimates and to complete the 
proof of Theorem 2. 

Proof of Theorem 2. We first establish part (ii). By virtue of The- 
orem 1 and Corollary 3, we have that 



it(r,c) ii(r,c) u(r,c) 

as d oo. Therefore, because of our observations in Section 4, is ex- 
ponentially efficient and part (ii) follows. Part (i) is established similarly 
thanks to Lemma 5. Note that 

= / k9{r,c) < -) 'r^f{r,c). 

Theorem 1 guarantees that f (r, c)^/u(r, c)^ — > 1 as d y oo. On the other 
hand, Lemma 3 implies that /(r, c) = 0(1) as d ^ oo. This concludes the 
proof of Theorem 2. □ 

Finally, before providing the proof of the pending results, it is worth 
discussing the practical implications of the previous bounds. The previous 
results imply a bound on the coefficient of variation that involves an expo- 
nential function to a power that depends on the maximum degree of the row 
sums. In practical situations, this bound can quickly become large, so the 
bounds given here, although computable, may be far too pessimistic in prac- 
tical applications. Improving these bounds is particularly interesting given 
that empirically according to Chen et al. (2005), the estimated coefficient of 
variation of the estimator given by Algorithm 1 is consistently small (they 
report values that are even less than 1). The key issue involves controlling 
the behavior of the row sums during the course of the algorithm under Q{-). 
The techniques here can be adapted to deal with situations when the row 
sums may grow and this will be illustrated elsewhere in the future. 

Proof of Lemma 3. We have that 

(2) (2) (2) (1) _ (2) (1) 

yk+l,n _ yk,n _ yk+l,nyk,n yk,nyk+l,n 

(1) ~ (1) ~ (1) (1) 

yk+l,n yk,n yk+l,nyk,n 

Now 

J2) _ „,(2) (1) _/ (2)_ 2 _ ,,(2) . (1) _ ^ 

yk+l,nyk,n yk,nyk+l,n ~ \yk,n •^k+l,n)yk,n yk,n\yk,n ■^k+l,n) 

_ I (2) (l)^ 

— Xk+l,n\.yk,n ~ ^k+l,nyk,n)- 
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The result then fohows from the fact that 

n n 
(2) _ 2 ^ _ (1) 

yk,n ~ ^j,n — Xk+l,n Xj^n — Xk+l,nyk,n- 

j=k+l j=k+l 

For part (ii) we note that, by assumption there exists a > such that y^^^ < 
a}-/'^yQ^^. Using Cauchy-Schwarz inequahty and part (i) it fohows that 

which imphes y\ l^<a{n — k). Finahy, combining part (i) and the assumption 
that yl^n/yo^n = we can write 

1/2 

Xj,ri ^ ^1/2 Xj,n _ ^ 

fj — 1,71 yj — l,n ■' J 1 ' ji , ' ji 

Now, it follows that 
We conclude that 



which yields (15). 

For part (iii), we use (14) and (15). In particular, we have that 

and, therefore, 

n ^yPo /n-l ^ \ 

which yields (16). □ 



Proof of Proposition 1. Define t^q = inf{A: > 0:8^ < do} and let 
= <^(Sj - <j < k) be the cj-field generated by the process S up to time 
k. As in Section 3, we let r be the first time k <n for which the number 
of strictly positive components of the vector is less than c^+i (and set 
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Cn+i = !)• Note that (using the notation a Ab for the minimum between a 
and b) 

Clearly, the dynamics of Q{-) imply 

5(SrdoAr,/5rdoAr)l(rdo > t) = 0, 

therefore, 

5(so,Po) = El,, ( n ^^§^9{Sr,,Pr,);r, < 
(21) ' ' 

<( sup 9{so,Po))Eirf[ 

Now, define the stochastic process (Z^ : A; > 0) via 

and consider the stopped process Mk = -^fcArdg • Note that {Mk : A; > 0) is a 
nonnegative supermartingale, that is, 

= E{Mk+i;Td, > k\J^k) + E{Mk+i;Td, < k\Tk) 
= l(r,, >k)l[ ^:^linlE{f{Sk+i,Pk+i)\Sk) 



1 

„,.2 



,=0 ^ y^j^Pj) 



j=0 

"'''0-1 ,,2/ 



+i(^<^o<^) n ^27|^/(s..o'^^+i) 
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Therefore 



/(so,Po)>^s'^0 




j=0 



W 



V 




) 




This estimate, together with (21), imphes the conclusion of the proposition. 
□ 
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